import jsonimport geopandas as gpdimport pandas as pdimport numpy as npimport foliumfrom folium.plugins import DualMap# Display the combined maps (e.g., in a Jupyter Notebook or save to HTML file)from IPython.display import HTML# libraries for machine learning model and folium mapfrom sklearn.model_selection import train_test_split# from sklearn.preprocessing import OneHotEncoderfrom xgboost import XGBRegressorfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scorefrom folium.features import GeoJson, GeoJsonTooltip
Abstract
Leveraging an XGBoost regression model, I accurately predicted average home prices across the state of Texas with an R^2 score of 0.84 and a Root Mean Squared Error (RMSE), Mean Squared Error (MSE), and Mean Absolute Error (MAE) at or below 0.35. Utilizing U.S. Census zipcode boundary data, the results were visualized in a Folium choropleth map showing most affordable housing can be found in the high/northwest plains, upper/south east, and south texas regions. This project provides a comprehensive, step-by-step guide to the modeling process, from data preprocessing to final evaluation.
Show the code
# Displaying my dual mapHTML("ml_datasets/dual_map.html")
Average Price Prediction (Log Scale)
Make this Notebook Trusted to load map: File -> Trust Notebook
Actual Average Price (Log Scale)
Make this Notebook Trusted to load map: File -> Trust Notebook
Home Prices in Texas
What is the average home price by Zipcode in Texas?
We know house prices continue to skyrocket across the nation, making it more difficult for the rising generation of Americans to afford a home. In Texas, the “median home prices rose by about 40 percent between 2019 and 2023…”(Green 2024). We know these overall statistics, but what areas have the highest home prices? And how do we measure the boundaries of an “area”? This information would be really beneficial for those looking to buy a home in the near future somewhere in Texas, whether it be in humid Houston, or dry Amarillo. It would be very useful for Texas residents to see the average home price for a particular bounded “area”. Better yet, are we able to predict the price of a home in Texas with household data such as land space, square footage, number of bedrooms, etc.? How accurate of a model can we create? In this study, my aim was to develop a model that can confidently predict the price of a home in Texas based on household data. The company Barking Data has provided a dataset on homes for sale across the United States (BarkingData 2022) that appears to be on homes being sold in the year 2022. It is difficult to come across the kind of information Barking Data so thoughtfully provided. Here is a sample of the household data attributes provided in the dataset:
address
street_name
apartment
city
latitude
longitude
postcode
price
bedroom_number
bathroom_number
price_per_unit
living_space
land_space
property_type
property_status
I cleaned up the dataset by first filtering the data to only look at data for the state of Texas (target state). Furthermore, I was interested in properties that were “For Sale”. I noticed that there were some homes that had a living space of only “0, 1, 2 or 3 square feet, so I removed this from my dataset. I also filtered my dataset by properties that were: apartments, condos, manufactured homes, multi-family homes, single-family homes, and townhouses. I removed any lots being sold because I was interested in homes only. Land space cannot be negative, and I noticed some negative landspace so I filtered out the rows with negative landscape values. After using other filtering methods, and one-hot encoding, I ended up with the following dataset (sample shown):
Show the code
# Here I am reading in my datadf = pd.read_csv("./ml_datasets/US Homes Data (For Sale Properties).csv")# I filter to only keep data from the state of Texas, specifically for homes that are for saledf = df[(df["state"].isin(["TX"])) & (df["property_status"].isin(["FOR_SALE"])) & (~df["living_space"].isin([0,1,2,3]))]# I only look at apartments, condo, manufactured, multi_family, single_family, and townhouse homesdf = df[~df['property_type'].isin(['LOT'])]# I filter out the rows with negative landscape valuesdf = df[~df['land_space'].isin([-10890.0])]# Here I change acres to square feet in the land_space columndf['land_space'] =round(df.apply(lambda row: row['land_space'] *43560if row['land_space_unit'] =='acres'else row['land_space'], axis=1),2)# Here I make the price_per_unit column more accuratedf['price_per_unit'] =round(df['price'] / df['living_space'], 2)df['price_per_unit'] = df.apply(lambda row: round(row['price'] / row['living_space'], 2) if (row['living_space'] >0) andnot pd.isna(row['living_space']) else row['price_per_unit'], axis=1)# Here I drop the extra columns that I do not need, and can drop nowdf = df.drop(['property_url','property_id', 'address', 'street_name', 'apartment', 'city', 'state','land_space_unit', 'broker_id', 'property_status','year_build', 'total_num_units' , 'listing_age', 'RunDate', 'agency_name', 'agent_name', 'agent_phone', 'latitude', 'longitude','price_per_unit'], axis=1).reset_index(drop=True)# One-hot encodingdf = pd.get_dummies(df, columns=['property_type'], dtype='int')# Here I change the name of my postcode column to zip_codedf.rename(columns={'postcode': 'zip_code'}, inplace=True)df.head(10)
zip_code
price
bedroom_number
bathroom_number
living_space
land_space
is_owned_by_zillow
property_type_APARTMENT
property_type_CONDO
property_type_MANUFACTURED
property_type_MULTI_FAMILY
property_type_SINGLE_FAMILY
property_type_TOWNHOUSE
0
79903
239500.0
5.0
3.0
1692.0
6969.6
0
0
0
0
0
1
0
1
79925
165000.0
4.0
2.0
1650.0
12632.4
0
0
0
0
0
1
0
2
79905
118000.0
4.0
1.0
1918.0
11325.6
0
0
0
0
0
1
0
3
79903
414700.0
4.0
3.0
3119.0
15246.0
0
0
0
0
0
1
0
4
79905
260000.0
NaN
NaN
3267.0
6000.0
0
0
0
0
1
0
0
5
79925
174950.0
4.0
2.0
1800.0
6098.4
0
0
0
0
0
1
0
6
79903
950000.0
8.0
4.0
6100.0
31363.2
0
0
0
0
0
1
0
7
79903
165000.0
3.0
1.0
1305.0
6098.4
0
0
0
0
0
1
0
8
79903
249000.0
3.0
3.0
2790.0
6098.4
0
0
0
0
0
1
0
9
79925
179000.0
5.0
2.0
2112.0
6098.4
0
0
0
0
0
1
0
Machine Learning Preparation
In order to finish preparing my dataframe for a machine learning model, I used a dataset provided in github that provides the geometric shapes by zipcode(“State-Zip-Code-GeoJSON”). This dataframe was used to extract the centroid latitude and centroid longitude for the machine learning model I ended up using- XGBoost Regression. I ended up using this dataframe later on to extract the geometry column which contains the geometric shapes that I needed to create a folium choropleth map that shows the accuracy of my model in predicting prices of homes. Dataframe sample below:
Show the code
# Machine learning preparation.# Turning the city column to a categorical type# This is important for XGBoost to handle the data correctly# df['city'] = df['city'].astype('category')# ml_df = df.copy()# Load the GeoJSON shape file for Texas ZIP codesurl ="https://raw.githubusercontent.com/OpenDataDE/State-zip-code-GeoJSON/refs/heads/master/tx_texas_zip_codes_geo.min.json"zip_shapes = gpd.read_file(url)# Make sure ZIP codes are strings (important for matching)df['zip_code'] = df['zip_code'].astype(str)zip_shapes['ZCTA5CE10'] = zip_shapes['ZCTA5CE10'].astype(str)# Here I do a left join (everything from the ml_df and only matching rows from zip_shapes)new_df = pd.merge(df,zip_shapes, left_on='zip_code', right_on='ZCTA5CE10', how='left')# Here I drop the extra columns that I do not needml_clean = new_df.drop(['STATEFP10', 'ZCTA5CE10', 'GEOID10','CLASSFP10', 'MTFCC10', 'FUNCSTAT10', 'ALAND10','AWATER10', 'INTPTLAT10', 'INTPTLON10', 'PARTFLG10'], axis=1)# Tell GeoPandas which column holds the geometryml_clean = gpd.GeoDataFrame(ml_clean, geometry='geometry')# Check the current coordinate reference system (CRS)# print(ml_clean.crs)# Now I can safely extract spatial features# Extract spatial features# Calculate centroid coordinates in metersml_clean['zip_centroid_lon'] = ml_clean.geometry.centroid.xml_clean['zip_centroid_lat'] = ml_clean.geometry.centroid.y# Here I drop the geometry column (not needed for modeling) and city column (one-hot encoding and label encoding are not ideal for this column)ml_clean = ml_clean.drop(columns=['geometry'])zip_shapes.head(10)
STATEFP10
ZCTA5CE10
GEOID10
CLASSFP10
MTFCC10
FUNCSTAT10
ALAND10
AWATER10
INTPTLAT10
INTPTLON10
PARTFLG10
geometry
0
48
75801
4875801
B5
G6350
S
555807428
6484251
+31.7345202
-095.5313809
N
POLYGON ((-95.68072 31.728, -95.68086 31.72807...
1
48
75839
4875839
B5
G6350
S
381863557
3928555
+31.6066930
-095.5833403
N
POLYGON ((-95.66269 31.64586, -95.66242 31.645...
2
48
78336
4878336
B5
G6350
S
126923194
31624523
+27.9269126
-097.1777757
N
POLYGON ((-97.19642 27.91194, -97.19618 27.912...
3
48
76389
4876389
B5
G6350
S
515265783
10157866
+33.5126278
-098.4576268
N
MULTIPOLYGON (((-98.37035 33.44176, -98.3782 3...
4
48
76310
4876310
B5
G6350
S
465637930
24888388
+33.7990847
-098.5098400
N
POLYGON ((-98.50723 33.84418, -98.50234 33.844...
5
48
78011
4878011
B5
G6350
S
493710052
579968
+28.7909153
-098.7176338
N
POLYGON ((-98.74144 28.94192, -98.74292 28.942...
6
48
78114
4878114
B5
G6350
S
950244404
4879176
+29.1147177
-098.2086610
N
POLYGON ((-98.0271 29.10656, -98.02895 29.1028...
7
48
78052
4878052
B5
G6350
S
98193838
285316
+29.2036790
-098.7766943
N
POLYGON ((-98.8381 29.22257, -98.83821 29.2268...
8
48
79371
4879371
B5
G6350
S
728089826
139271
+33.9526210
-102.9662131
N
MULTIPOLYGON (((-102.4525 34.08832, -102.45252...
9
48
78650
4878650
B5
G6350
S
172543341
488366
+30.3038849
-097.2188991
N
POLYGON ((-97.25978 30.2201, -97.25981 30.2201...
Results
It is important to note that the geometry column gets dropped since this is not needed in the machine learning model. However, as mentioned earlier, before dropping it I derived the centroid longitude and centroid latitude from the geometry column to use in the machine learning model. This column is used again after training my machine learning model in order to demonstrate the accuracy of the results of my machine learning model compared to the test data in a folium choropleth map. Below you will find the clean final dataframe (sample) that was used by my machine learning model . It is important to note that the target variable used for the machine learning model was the log of 1 plus the “price” column (the home “price” column is represented as an “x” in the equation):
\[
\text{target column} = \log\left(1+x\right)
\]
The reason why I decided to use this as my target column is because the outliers in the price column were causing my folium choropleth legend scale to be heavily skewed, making it unreadable. Therefore, by changing the scale of the target column house price, the legend for each folium choropleth map shows up balanced, making it easy to read.
The reason why I ended up going with the XGBoost Regression model is because this model can work with features that contain missing data, and we used the regression model of XGBoost because our target variable is continuous rather than categorical. I decided to use a test size of 0.20 and a random state of 42 when splitting my data for training and when I created the instance of my model in order to help with reproducibility of the test.
Let’s go over the evaluation metrics I used for my model. The Mean Squared Error (MSE) measures the average of the squared differences between the predicted and actual values (see equation below):
This model is sensitive to outliers. We can see from my results below that the MSE was 0.12, which is low.
The Root Mean Squared Error (RMSE) is the square root of the Mean Squared Error (MSE), which means it brings the units back to those of the target variable (the log scale of the home prices column in this case). This metric provides the average error size. Please see the equation below:
We can see that the Root Mean Squared Error (RMSE) is 0.35. Again, the root mean squared error is low.
The Mean Absolute Error (MAE) measures the average error without considering the direction (negative, positive). Please see equation below:\[
\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} \left| y_i - \hat{y}_i \right|
\]
Our Mean Absolute Error is 0.22, which is low.
Finally we look at the R2 score. This measures the proportion of the variance in the dependent variable (target variable) that is predictable from the independent variables. R2 values fall within the range 0 to 1 (0% to 100%). The closer the number is to 1 the closer it is to a perfect fit. In other words, the model better explains the variability in the dependent variable (target variable). Please see the equation below:
Below you will see the dataframe that was fed to the XGBoost model to train and test the model (X- represents the features. y- represents the target variable). You will notice the change in the scale of the target variable to a log scale as previously mentioned. Below that you will see the results of the metrics used to evaluate the model as discussed.
# Drop zipcode for model trainingX = ml_clean.drop(columns=['price'])y = ml_clean['price']# Here I make the target variable a log to improve my machine learning model performancey = np.log1p(y)X_train_full, X_test_full, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Here I save the zipcodes separately before I drop them for modeling# Zipcodes aligned to my training setzip_train = X_train_full['zip_code'].reset_index(drop=True)# Zipcodes aligned to my test setzip_test = X_test_full['zip_code'].reset_index(drop=True)# Here I remove the zipcodes from the actual training dataX_train = X_train_full.drop(columns=['zip_code'])X_test = X_test_full.drop(columns=['zip_code'])# create model instancexgb = XGBRegressor(n_estimators=350, max_depth=10, learning_rate=.01, random_state=42)# fit modelxgb.fit(X_train, y_train)# make predictionsy_pred = xgb.predict(X_test)
Show the code
# Drop zipcode for model trainingX = ml_clean.drop(columns=['price'])y = ml_clean['price']# Here I make the target variable a log to improve my machine learning model performancey = np.log1p(y)X_train_full, X_test_full, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# Here I save the zipcodes separately before I drop them for modeling# Zipcodes aligned to my training setzip_train = X_train_full['zip_code'].reset_index(drop=True)# Zipcodes aligned to my test setzip_test = X_test_full['zip_code'].reset_index(drop=True)# Here I remove the zipcodes from the actual training dataX_train = X_train_full.drop(columns=['zip_code'])X_test = X_test_full.drop(columns=['zip_code'])# create model instancexgb = XGBRegressor(n_estimators=350, max_depth=10, learning_rate=.01, random_state=42)# fit modelxgb.fit(X_train, y_train)# make predictionsy_pred = xgb.predict(X_test)print("X-Features Sample:")X.head(10)
print("Mean Squared Error (MSE):", round(mean_squared_error(y_test, y_pred),2))rmse =round(np.sqrt(mean_squared_error(y_test, y_pred)),2)print("Root Mean Squared Error (RMSE):", round(rmse,2))print("Mean Absolute Error (MAE):", round(mean_absolute_error(y_test, y_pred),2))print("R² Score:", round(r2_score(y_test, y_pred),2))# # undo the log1p transformation to get the actual price predictions# y_pred = np.expm1(y_pred)# y_test = np.expm1(y_test)results = pd.DataFrame({'zip_code': zip_test,'prediction': y_pred,'actual_price': y_test.reset_index(drop=True)})# results# Here I group by zip_code column and calculate the mean of the prediction and actual valuesresults = results.groupby('zip_code')[['prediction', 'actual_price']].mean()# Round to two decimal placesresults['actual_price'] = results['actual_price'].round(2)# Here I change the name of my predicition column and actual price columnresults = results.rename(columns={'prediction': 'average_price_prediction_log', 'actual_price': 'actual_average_price_log'})results = pd.DataFrame(results).reset_index()# undo the log1p transformation to get the actual price predictions in different columnsresults['average_price_prediction'] = np.expm1(results['average_price_prediction_log'])results['actual_average_price'] = np.expm1(results['actual_average_price_log'])# Here I do a left join (everything from the ml_df and only matching rows from zip_shapes)results = pd.merge(results,zip_shapes, left_on='zip_code', right_on='ZCTA5CE10', how='left')# Here I drop the extra columns that I do not needresults = results.drop(['STATEFP10', 'ZCTA5CE10', 'GEOID10','CLASSFP10', 'MTFCC10', 'FUNCSTAT10', 'ALAND10','AWATER10', 'INTPTLAT10', 'INTPTLON10', 'PARTFLG10'], axis=1)# Tell GeoPandas which column holds the geometryresults = gpd.GeoDataFrame(results, geometry='geometry')# Simplify geometry to reduce file size (tolerance controls precision)# Tolerance controls how much simplification is applied. Higher values = more simplification (more vertices removed).# Lower values = more detailed shape retained. In this case, 0.01 is a relatively small simplification, meaning fine detail is mostly preserved.# preserve_topology=True ensures that the simplified geometry does not become invalid. Prevents things like self-intersecting polygons, holes merging or disappearing incorrectly, borders of neighboring shapes overlapping or separating when they shouldn’t.results['geometry'] = results['geometry'].simplify(tolerance=0.01, preserve_topology=True)# results
Mean Squared Error (MSE): 0.12
Root Mean Squared Error (RMSE): 0.35
Mean Absolute Error (MAE): 0.22
R² Score: 0.85
Folium Choropleth - Map Visualization (Predicted vs Actual)
Now lets talk about the results in the folium choropleth map. You will notice that the dataset did not provide all the zipcodes in the state of Texas. This is evident by the missing boundaries. However, there was plenty of data available. After obtaining the results of the test data from the XGBoost model, I needed to aggregate the data by grouping by zipcodes and obtaining the average home price (in log scale) for each zipcode. On the map to the left, you will find the average price prediction by the XGBoost model for each zipcode and on the map to the left you will find the actual average price for each zipcode. As you can see, the model does a great job in predicting average home prices, the results are nearly identical to the actual average prices as shown in the map on the right. It is difficult to know what the actual home price is of a home if it is measured in a log scale, which is why the predicted and actual average prices are given normally (converted back to dollars) for each zipcode boundary in each map by hovering over any boundary on either map. That way, you know the actual average price of a home in U.S dollars.
Show the code
# The actual heat map# geojson_data = tacos.__geo_interface__ # convert GeoDataFrame to GeoJSON# Create a map centered over Texas# width="35%", height="70%"m1 = folium.Map(location=[31.9686, -99.9018], zoom_start=5.25, tiles="openstreetmap")# Add the choropleth (color-coded layer)folium.Choropleth(# This is the GeoJSON data that contains the shapes of the ZIP code areas. geo_data=results,# This is the data that I want to visualize. data=results,# Here I specify what columns to use (the column that ties to the geographic shapes and the column with the data to visualize).# The first column is the ZIP code, and the second column is the average price. columns=['zip_code', 'average_price_prediction_log'],# The key_on value should be a string that represents the path in the GeoJSON structure to the property (key) that holds the value you want to match with the data DataFrame. key_on='feature.properties.zip_code',# This sets the color scale used to fill each area. fill_color='plasma',# This sets the transparency of the filled areas. fill_opacity=0.7,# This sets the transparency of the boundary lines between the geographic shapes. line_opacity=0.45, legend_name="Average Price Prediction (Log Scale- Smallest to Largest)").add_to(m1)tooltip = GeoJson( results, style_function=lambda x: {# Don't add a fill color (we're already coloring with the choropleth)'fillColor': 'transparent',# Hide the border line color'color': 'transparent',# No border line thickness'weight': 0 }, tooltip=GeoJsonTooltip(# Columns to display in the tooltip fields=['zip_code', 'average_price_prediction'],# What to display as labels for the fields (instead of the column names) aliases=['ZIP Code:', 'Average Price Prediction:'],# Formats numbers using local formatting (e.g. commas in large numbers) localize=True,# Tooltip "sticks" to your mouse as you move around that shape. Nice UX sticky=True,# Shows the field names (the aliases you defined). labels=True )).add_to(m1)# The actual heat map# geojson_data = tacos.__geo_interface__ # convert GeoDataFrame to GeoJSON# Create a map centered over Texas# width="35%", height="70%"m2 = folium.Map(location=[31.9686, -99.9018], zoom_start=5.25, tiles="openstreetmap")# Add the choropleth (color-coded layer)folium.Choropleth(# This is the GeoJSON data that contains the shapes of the ZIP code areas. geo_data=results,# This is the data that I want to visualize. data=results,# Here I specify what columns to use (the column that ties to the geographic shapes and the column with the data to visualize).# The first column is the ZIP code, and the second column is the average price. columns=['zip_code', 'actual_average_price_log'],# The key_on value should be a string that represents the path in the GeoJSON structure to the property (key) that holds the value you want to match with the data DataFrame. key_on='feature.properties.zip_code',# This sets the color scale used to fill each area. fill_color='plasma',# This sets the transparency of the filled areas. fill_opacity=0.7,# This sets the transparency of the boundary lines between the geographic shapes. line_opacity=0.45, legend_name="Actual Average Price (Log Scale - Smallest to Largest)").add_to(m2)tooltip = GeoJson( results, style_function=lambda x: {# Don't add a fill color (we're already coloring with the choropleth)'fillColor': 'transparent',# Hide the border line color'color': 'transparent',# No border line thickness'weight': 0 }, tooltip=GeoJsonTooltip(# Columns to display in the tooltip fields=['zip_code', 'actual_average_price'],# What to display as labels for the fields (instead of the column names) aliases=['ZIP Code:', 'Actual Average Price:'],# Formats numbers using local formatting (e.g. commas in large numbers) localize=True,# Tooltip "sticks" to your mouse as you move around that shape. Nice UX sticky=True,# Shows the field names (the aliases you defined). labels=True )).add_to(m2)# Save the maps as separate HTML files# m1.save("ml_datasets/predicted_price_map.html")# m2.save("ml_datasets/actual_price_map.html")# Embed maps side-by-side in HTMLhtml_string =""" <div style="display: flex;"> <div style="width: 50%; height: 100%;"> <h3>Average Price Prediction (Log Scale)</h3>{m1} </div> <div style="width: 50%; height: 100%;"> <h3>Actual Average Price (Log Scale)</h3>{m2} </div> </div>""".format(m1=m1._repr_html_(), m2=m2._repr_html_())withopen("ml_datasets/dual_map.html", "w") as f: f.write(html_string)dual_map = HTML(html_string)dual_map
Average Price Prediction (Log Scale)
Make this Notebook Trusted to load map: File -> Trust Notebook
Actual Average Price (Log Scale)
Make this Notebook Trusted to load map: File -> Trust Notebook